Categorizing Document Images into Script and Language Classes

نویسنده

C. Y. Suen

چکیده

In order to properly archive and index large numbers of international documents, several challenging processing steps must be completed even before optical character recognition (OCR) can be applied. We present a system that preclassiies documents for further processing and OCR. The system operates in four phases: preprocessing (includ-We present a set of statistical techniques, based fundamentally on connected component analysis and horizontal projections, for the rst two phases. Even with little training, the system predicts the correct script category in 91% of the cases, when tested on real-life documents of varying kinds, diverse formats and qualities from many sources. The third and fourth phase are based on expert systems approaches. Language identii-cation combines several heuristics based on a statistical analysis of our training corpus. It currently has a 95% success rate on real-life documents of moderate quality. We will discuss the techniques and their combination and the process of improving performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Skew Detection, Page Segmentation, and Script Classiication of Printed Document Images

Automatic processing of international documents presents a number of challenging problems because Optical Character Recognition (OCR) techniques are not available for all languages and all script classes. Document images must be categorized according to their script type rst, in our case Roman, Ideographic, or Arabic. We present a set of statistical methods that rst detect and correct the skew ...

متن کامل

Determination of the Script and Language Content of Document Images

Most document recognition work to date has been performed on English text. Because of the large overlap of the character sets found in English and major Western European languages such as French and German, some extensions of the basic English capability to those languages have taken place. However, automatic language identification prior to optical character recognition is not commonly availab...

متن کامل

Script and Language Identification for Document Images and Scene Texts

In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, th...

متن کامل

Neural network based system for script identification in Indian documents

The paper describes a neural network-based script identification system which can be used in the machine reading of documents written in English, Hindi and Kannada language scripts. Script identification is a basic requirement in automation of document processing, in multi-script, multi-lingual environments. The system developed includes a feature extractor and a modular neural network. The fea...

متن کامل

Script Identification for Document Image Retrieval: A Survey

In recent years there are many multimedia documents captured and stored with the advances in computer technology and hence the demand for recognizing and retrieval of such documents has increased tremendously .In such environment the large volume of data and variety of scripts make manual identification unworkable. In such cases the ability to automatically determine the script ,and further the...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Categorizing Document Images into Script and Language Classes

نویسنده

چکیده

منابع مشابه

Skew Detection, Page Segmentation, and Script Classiication of Printed Document Images

Determination of the Script and Language Content of Document Images

Script and Language Identification for Document Images and Scene Texts

Neural network based system for script identification in Indian documents

Script Identification for Document Image Retrieval: A Survey

عنوان ژورنال:

اشتراک گذاری